-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Preserve Field throughout Variant shredding flow
#8673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| value: Option<&BinaryViewArray>, | ||
| index: usize, | ||
| ) -> Variant<'a, 'a> { | ||
| let data_type = typed_value.data_type(); | ||
| let (_typed_value_field, typed_value_column) = typed_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the field information, we can check for extension types like: e21dc1b
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a stab at this. I'm not immediately sure how to react, so I just left a couple high level comments while I stew on it more.
| .typed_value_field() | ||
| .unwrap() | ||
| .1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just glancing at this code in isolation, I would guess that .0 is the field and .1 is the array?
But that requires knowing context (ie of this PR)
If this usage will show up often, is it worth returning a newtype instead of a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , please, let's return a type with real field names and documentation
| let typed_value = if let Some(typed_value_array) = typed_value.clone() { | ||
| let field_ref = Arc::new(Field::new( | ||
| "typed_value", | ||
| typed_value_array.data_type().clone(), | ||
| true, | ||
| )); | ||
| builder = builder.with_field(field_ref.clone(), typed_value_array.clone()); | ||
|
|
||
| Some((field_ref, typed_value_array)) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is just Option::map (no ? to complicate things)
| let typed_value = if let Some(typed_value_array) = typed_value.clone() { | ||
| let field_ref = Arc::new(Field::new( | ||
| "typed_value", | ||
| typed_value_array.data_type().clone(), | ||
| true, | ||
| )); | ||
| builder = builder.with_field(field_ref.clone(), typed_value_array.clone()); | ||
|
|
||
| Some((field_ref, typed_value_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the dual typing (one in field and one in array), this code is technically no longer infallible -- the data types could disagree. Not sure the best way to address that issue as long as we're passing a full-blown field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this logic is duplicated with the variant array constructor above.
I think one thing I'm debating is the coupling between the |
| .typed_value_field() | ||
| .unwrap() | ||
| .1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , please, let's return a type with real field names and documentation
| @@ -708,6 +730,9 @@ impl From<ShreddedVariantFieldArray> for StructArray { | |||
| } | |||
| } | |||
|
|
|||
| pub type TypedValue = (FieldRef, ArrayRef); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This combination of an Array (which only has a DataType) and a Field (which has the DataType and metadata information) has come up recently upstream as @paleolimbot is working extension types through DataFusion as well in
We are still looking for a good pattern to use.
I wonder if something like this is too crazy
/// A value that has associated Arrow field infomration
struct TypedValue<T> {
value: T, // could be ArrayRef, could be ScalarValue
field: FieldRef,
}🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably struct TypedArrayRef { value: ArrayRef, field: FieldRef } is best for now (keeping the FieldRef-is-a-type bit opaque). Using FieldRefs as types in DataFusion has mostly led to a lot of cloned metadata and heterogeneity with respect to how the name/metadata/nullability should propagate (or not) but it does seem to be the path of least resistance.
28bd6bb to
398335b
Compare
398335b to
ca4a0ae
Compare
3cf96c8 to
bbb681c
Compare
Check for field metadata when accessing typed value
091eb35 to
d96ef3d
Compare
friendlymatthew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review
| // note: these methods below make me want to impl Array for TypedArrayRef... | ||
| pub fn slice(&self, offset: usize, length: usize) -> Self { | ||
| let Self { inner, field } = self; | ||
|
|
||
| Self { | ||
| inner: inner.slice(offset, length), | ||
| field: Arc::clone(field), | ||
| } | ||
| } | ||
|
|
||
| pub fn is_valid(&self, index: usize) -> bool { | ||
| self.inner.is_valid(index) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what people think. Since TypedArrayRef owns the inner ArrayRef, it'll be pretty easy to get TypedArrayRef to also impl Array.
I don't have a strong opinion, but it feels weird to rewrite methods that exist on a trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if we want to find a better home for this struct. I suspect a TypedArrayRef would be useful in various parts of the library, not just the variant-related work 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely solves the problem of writing APIs that need to propgagate extension metadata without constantly pairing a FieldRef. It may be worth starting with the internal struct and the methods like you have here and moving to a trait when there's a case that arises elsewhere that also needs that concept? (I'm not very good at predicting where things should go inside the arrow-rs crates)
| impl PartialEq for TypedArrayRef { | ||
| #[allow(clippy::op_ref)] | ||
| fn eq(&self, other: &Self) -> bool { | ||
| &self.inner == &other.inner && self.field == other.field | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was forced to add references to both operands (removing it fails to compile). Spent a lot of time this morning trying to fix/investigate this issue
Weirdly, using a tuple like I did in a prior iteration of this PR would compile 🤔
For those curious:
Here's a repro: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=c868bf8f2f8e2b699f9a84b55a89cd71
And here's a more reduced version: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=224ffe1076c24e1c8db5d4a8f53ca2fc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have vague memories of hitting head-scratchy bugs in comparison-wielding code like this, and was informed there are actually rustc bugs involved. Might it be rust-lang/rust#31740?
|
|
||
| #[test] | ||
| fn test_fixed_size_binary_without_uuid_extension() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a test that will err when trying to access a FixedSizeBinary(16) without a Uuid extension type
| #[test] | ||
| fn test_fixed_size_binary_with_uuid_extension() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a test case that will pass since we properly annotated the Uuid extension type to its field metadata
|
I hope to get to this soon, but am a bit buried at work this week and last... |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a lot of potentially unnecessary churn?
If we can't eliminate it with Deref etc, might we consider a separate PR that separates the churn from functional changes?
| // note: these methods below make me want to impl Array for TypedArrayRef... | ||
| pub fn slice(&self, offset: usize, length: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we could actually leverage Array::slice because it returns Arc<dyn Array>, which cannot be converted to TypedArrayRef?
But we could potentially impl Deref for TypedArrayRef with Target = dyn Array, to cover e.g. is_valid and the various array-casting calls in the shredding code?
| let data_type = typed_value.data_type(); | ||
| let TypedArrayRef { | ||
| inner: typed_value_array, | ||
| field: typed_value_field, | ||
| } = typed_value; | ||
|
|
||
| let data_type = typed_value_array.data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this churn, an impl Deref could make it transparent?
(see other comment)
| typed_value_field.try_extension_type(), | ||
| Ok(arrow_schema::extension::Uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: It seems like we had a TODO somewhere, to check for an extension type without having to actually allocate it? Or is this check super cheap and we don't care?
| let Some(typed_value) = shredding_state.typed_value_field() else { | ||
| let Some(BorrowedTypedArrayRef { | ||
| inner: typed_value_array, | ||
| field: _typed_value_field, | ||
| }) = shredding_state.typed_value_field() | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, Deref seems like a less invasive option here?
| .with_field("metadata", Arc::new(metadata_array), false) | ||
| .with_field("value", Arc::new(value_array), true) | ||
| .with_field("typed_value", Arc::new(typed_value_struct), true) | ||
| .with_column_name("metadata", Arc::new(metadata_array), false) | ||
| .with_column_name("value", Arc::new(value_array), true) | ||
| .with_column_name("typed_value", Arc::new(typed_value_struct), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a fair bit of churn from this rename, and I'm not certain the new name is accurate, given that it specifies more than just the column name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually bad to keep the original with_field name, given that its signature nearly matches Field::new? And the new method can keep its (also perfectly descriptive) name with_field_ref?
| typed_value: self.typed_value_field(), | ||
| typed_value: self | ||
| .typed_value_field() | ||
| .map(|TypedArrayRef { inner, field }| BorrowedTypedArrayRef { inner, field }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this merit an impl From?
| VariantArray::try_new(&struct_array).expect("should create variant array"); | ||
|
|
||
| let res = variant_array.try_value(0); | ||
| assert!(res.is_err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assert the error message here?
| // // inspect the typed_value Field and make sure it contains the canonical Uuid extension type | ||
| // let typed_value_field = variant_array | ||
| // .inner() | ||
| // .fields() | ||
| // .into_iter() | ||
| // .find(|f| f.name() == "typed_value") | ||
| // .unwrap(); | ||
|
|
||
| // assert!( | ||
| // typed_value_field | ||
| // .try_extension_type::<extension::Uuid>() | ||
| // .is_ok() | ||
| // ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to uncomment this check in current PR?
Which issue does this PR close?
FixedSizeBinary(16)shredding #8665Rationale for this change
This PR preserves the
typed_value'sFieldmetadata. This way, we can check for extension types.